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The work presented in this paper describes the application of a multiblock gridding strategy to 
the solution of aerodynamic design optimization problems involving complex configurations. The 
design process is parallelized using the MPI (Message Passing Interface) Standard such that it 
can be efficiently run on a variety of distributed memory systems ranging from traditional parallel 
computers to networks of workstations. Substantial improvements to the parallel performance of 
the baseline method are presented, with particular attention to their impact on the scalability of 
the program as a function of the mesh size. Drag minimization calculations at a fixed coefficient of 
lift are presented for a business jet configuration that includes the wing, body, pylon, aft-mounted 
nacelle, and vertical and horizontal tails. An aerodynamic design optimization is performed with 
both the Euler and Reynolds Averaged Navier-Stokes (RANS) equations governing the flow solu- 
tion and the results are compared. These sample calculations establish the feasibility of efficient 
aerodynamic optimization of complete aircraft configurations using the RANS equations as the flow 
model. There still exists, however, the need for detailed studies of the importance of a true viscous 
adjoint method which holds the promise of tackling the minimization of not only the wave and 
induced components of drag, but also the viscous drag. 


INTRODUCTION 

During the course of the last few years, there has 
been a concentrated effort within our group to de- 
velop fast and efficient methods for the solution of 
viscous fluid flows over complex aircraft configura- 
tions. The path to achieve this goal has seen nu- 
merous improvements in convergence acceleration 
techniques (multigrid, implicit residual averaging), 
viscous discretization algorithms, higher order dis- 
sipation schemes for shock capturing and boundary 
layer resolution, unstructured and multiblock grid 
approaches, and parallel implementations based on 
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domain decomposition ideas [30, 50, 43, 49, 2, 1], 
The combination of all these factors has resulted in 
a variety of flow solvers that adhere to the highest 
standards of accuracy and efficiency. 

Although direct flow analysis of existing configu- 
rations has in the past provided the aircraft de- 
signer with invaluable information to overcome a 
wide range of problems, there is the need for Com- 
putational Fluid Dynamics (CFD) methods which, 
in addition, provide information about geometry 
changes that are necessary to improve an existing 
design with respect to a pre-specified figure of merit. 
It is within this framework where the utilization of 
fast solution techniques is of utmost importance. 

Most effective aerodynamic optimization methods 
are based on the calculation of the derivatives of the 
figure of merit with respect to the design variables 
in the problem. Unfortunately, the calculation of 
these derivatives using the straightforward method 
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of finite differencing is usually prohibitively expen- 
sive. While it is possible to perform aerodynamic 
optimization on a limited class of problems using 
the finite difference approach [19, 18, 51, 44, 42], 
large scale problems that are of the greatest engi- 
neering interest do not belong to this class. An al- 
ternative technique first suggested for problems in- 
volving partial differential equations by Lions [40], 
and extended for the treatment of compressible flow 
by Jameson [23, 24] is the control theory approach. 
This methodology employs control theory applied to 
systems governed by partial differential equations to 
derive a co-state or adjoint system of equations. This 
adjoint equation has similar complexity to the flow 
solution, and allows the calculation of the complete 
gradient of the figure of merit with a cost which is 
essentially independent of the number of design vari- 
ables in the problem. 

The computational cost of performing viscous based 
design is considerably larger than for design using 
the Euler equations because: a) the number of mesh 
points must be increased by a factor of about five to 
resolve boundary layers and wakes, b) there is the 
additional cost of computing the viscous terms and a 
turbulence model, and c) Navier-Stokes calculations 
generally converge much more slowly than Euler so- 
lutions because of stiffness arising from the highly 
stretched boundary layer cells. Therefore, the com- 
putational feasibility of viscous design hinges on the 
development of a rapidly convergent Navier-Stokes 
flow solver which is able to handle complex configu- 
rations and is efficiently implemented on the current 
generation of distributed memory architectures. 

With this in mind, the logical approach to the solu- 
tion of the aerodynamic design problem is to link to- 
gether fast iterative solvers and the adjoint solution 
methodology in order to produce a computational 
method which can address the needs of the aircraft 
designer: high solution accuracy, fast turnaround, 
geometric complexity, and automated shape design. 
Even with the use of an adjoint solver, large scale 
design problems using the Navier-Stokes equations 
that are considered in this work require massive com- 
putational resources. Future work will place even 
more extreme demands on the computational power 
needed. Therefore, it was decided to attempt to ex- 
ploit the power of emerging distributed memory par- 
allel computers with efficient standardized message- 
passing implementations. Thus, much emphasis in 
this paper has been placed, not only on demonstrat- 
ing the viability of performing automatic designs on 
complex configurations, but also on minimizing the 
communication overhead incurred by mapping the 
method onto either parallel computers or clusters of 
workstations. 

In this paper we present one of the possible varia- 


tions of the adjoint based design technique for com- 
plex geometries, where the flow and adjoint solvers 
have been implemented using a multiblock strat- 
egy. Both the Euler and Reynolds Averaged Navier- 
Stokes equations are used to solve drag minimiza- 
tion problems. In both circumstances, the adjoint 
system solved to obtain the sensitivity of the fig- 
ure of merit with respect to the design variables is 
based on the inviscid equations only. The authors 
feel that the effective use of a viscous adjoint in a re- 
alistic design environment will require much further 
development and validation work. Furthermore, the 
approach presented here is a natural evolution of our 
previous work [50, 43, 49] which had already been 
extended to treat inviscid flows over complex config- 
urations. The work directed towards the improve- 
ment of the parallel performance of the method was 
motivated by an interest in demonstrating the per- 
formance of the method on computational platforms 
which do not possess the bandwidth and latency that 
is realizable on highly integrated parallel machines. 


CONTROL THEORY FORMULATION 
FOR SHAPE DESIGN 

The presentation of the control theory approach to 
optimal design is well documented elsewhere [23, 28], 
and only a brief summary is given here. 

The progress of the design procedure is measured in 
terms of a cost function I, which could be, for exam- 
ple, the drag coefficient or the lift to drag ratio. For 
the flow about an airfoil or wing, the aerodynamic 
properties which define the cost function are func- 
tions of the flow-field variables ( w ) and the physical 
location of the boundary, which may be represented 
by the function J 7 , say. Then 

I = I(w,F), 

and a change in T results in a change 


.. dI T x dF 

SI = ss Sw + w** 


(1) 


in the cost function. Using control theory, the gov- 
erning equations of the flow field are introduced as a 
constraint in such a way that the final expression for 
the gradient does not require multiple flow solutions. 
This corresponds to eliminating 8w from (1). 

Suppose that the governing equation R which ex- 
presses the dependence of w and T within the flow- 
field domain D can be written as 

R(w,F) = 0. (2) 

In our current work, R may be expressed by either 
the Euler or Navier-Stokes equations. Then 8w is 
determined from the equation 


8R = 


dR' 

dw 


8w + 


dR' 

dT 


5 ^ = 0 . 


( 3 ) 
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Next, introducing a Lagrange multiplier ip, we have MULTIBLOCK FLOW SOLVER 



Choosing ip to satisfy the adjoint equation 


Multiblock strategy 

FLO107-MB is a three-dimensional, multiblock, Eu- 
ler and Navier-Stokes flow solver suitable for the so- 
lution of external and internal flows around complex 
configurations. The discretization of the governing 
equations of the flow is accomplished using a cell- 
centered finite volume method. The flow domain is 
divided into a large number of small subdomains, 
and the integral form of the conservation laws 


&R1 T _ dl_ 
dw dw 


(4) 


d_ 

dt 


/ w dV + F 
Jb 


• dS = 0 


the first term is eliminated, and we find that 


SI = QST, (5) 


where 


Q 



'dR' 

er m ' 


The advantage is that (5) is independent of Sw, with 
the result that the gradient of I with respect to an 
arbitrary number of design variables can be deter- 
mined without the need for additional flow-field eval- 
uations. In the case that (2) is a partial differential 
equation, the adjoint equation (4) is also a partial 
differential equation and determination of the appro- 
priate boundary conditions requires careful mathe- 
matical treatment. 


The computational cost of a single design cycle is 
roughly equivalent to the cost of two flow solutions 
since the the adjoint problem has similar complexity. 
When the number of design variables becomes large, 
the computational efficiency of the control theory 
approach over traditional finite differencing strate- 
gies, which require direct evaluation of the gradients 
by individually varying each design variable and re- 
computing the flow field, becomes compelling. 

Once equation (5) is established, an improvement 
can be made with a shape change in the direction of 
the negative gradient 


st = -a g 


where A is positive, and small enough that the first 
variation is an accurate estimate of SI. Then 

SI = -A Q t Q < 0. 

After making such a modification, the gradient can 
be recalculated and the process repeated to follow a 
path of descent until a minimum is reached. Varia- 
tions on the optimization procedure which allow for 
the treatment of structural and aerodynamic con- 
straints can be readily incorporated in this approach. 


is applied to each subdomain. Here F is the flux 
function which can include the viscous fluxes in the 
case of the Navier-Stokes equations, w is the vec- 
tor of flow variables, and dS is the directed sur- 
face element of the boundary B of the domain V. 
The use of the integral form has the advantage that 
no assumption of the differentiability of the solu- 
tion is implied, with the result that it remains a 
valid statement for a subdomain containing shock 
waves. In general the subdomains could be arbi- 
trary, but in this work we use the hexahedral cells 
of a multiblock body-conforming curvilinear mesh. 
Discretizations of this type reduce to central dif- 
ferences on a regular Cartesian grid, and in order 
to eliminate possible odd-even decoupling modes al- 
lowed by the discretization, some form of artificial 
dissipation must be added. Moreover, when shock 
waves are present, it is necessary to upwind the dis- 
cretization to provide a non-oscillatory capture of 
discontinuities. The current version of the multi- 
block flow solver accomplishes this task using either 
a switched scalar dissipation scheme or the more 
sophisticated Convective Upstream Split Pressure 
(CUSP) approach, coupled with an Essential Local 
Extremum Diminishing (ELED) formulation. De- 
tails on these techniques and an extensive validation 
of the scheme for both inviscid and viscous flow, can 
be found in [26, 27, 54]. 

In order to apply the finite volume technique to the 
solution of flows around complex configurations, we 
have chosen to implement a multiblock strategy. In a 
multiblock environment, a series of structured blocks 
of varying sizes is constructed such that these blocks 
fill the complete space and conform to the surface 
of the geometry of interest. This segmentation of 
the complete domain into smaller blocks avoids the 
topological problems present in constructing a grid 
around complex configurations and multiply con- 
nected regions. The general strategy in the solution 
procedure of the multiblock flow solver is to con- 
struct a halo of cells which surrounds each block and 
contains information from cells in the neighboring 
blocks. This halo of cells, when updated at appro- 
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priate times during the numerical procedure, allows 
the flow solution inside each block to proceed inde- 
pendently of the others. 

This approach requires establishing the number and 
location of halo cells adjacent to block boundaries 
and constructing lists of halo cells and their inter- 
nal counterparts in the global mesh. In our case, 
we have chosen to carry out these setup procedures 
as part of a pre-processing module. During the 
pre-processing step, a two-level halo is constructed 
around each block. The requirement of this double 
halo results from the necessity to calculate all the 
necessary fluxes for the internal cells of each block 
without reference to additional cell locations outside 
the block in question. In particular, the second dif- 
ferences used for the third order dissipation terms 
require the values of the flow variables in the two 
neighboring cells on all sides of the cell in question. 
As we will see later, this approach is at the heart of 
the parallel implementation of the method. 


where 

Q = JK~\ 

The Euler equations can now be written as 


dW dFj 
dt + d& 


= 0 


in D, 


(10) 


with 


\ 

P 


pH 

pui 


pUiUi + Qnp 

PU 2 

> , Fi — Qijfj — < 

pU{U2 + Qi2P 

puz 


pUiUz + Qap 

. » E . 




For the multiblock flow solver, the above notation 
applies to each block in turn. The flow is thus deter- 
mined as the steady-state solution to equation (10) 
in all blocks, subject to the flow tangency or no-slip 
conditions on solid boundary faces: 


The system of equations solved as well as the solu- 
tion strategy follows that presented in many earlier 
works [34, 22, 21]. The governing equations of the 
flow may be written as 


dw dfi 
dt dxi 


0 in D , 


(6) 


where it is convenient to denote the Cartesian coor- 
dinates and velocity components by aq, x 2 , x 3 and 
ui, 112 , 163, and w and /* are defined as 


P 


f \ 

pm 

pui 


puiUi + pSn 

< pu 2 

> , fi = < 

pUiU2 + pSi2 

PU'A 


pUiU 3 + ps i3 

K pE j 


k ptliH ^ 


with Sij being the Kronecker delta function. Note 
that this definition of the flux functions fi corre- 
sponds to the Euler equations. They can be ex- 
panded to include the appropriate viscous terms 
in the Reynolds Averaged Navier-Stokes equations 
without modification to this discussion. Details of 
the construction of these viscous fluxes are presented 
in the following section. Also, 


p=( 7 -l)p (8) 


and 

pH = pE + p (9) 

where 7 is the ratio of the specific heats. Consider 
a transformation to coordinates £1, £ 2? £3 where 


Kij — 


^j, J ~ det {K ) , K-/ = 


d£i 


dx 


3 J 


Introduce scaled contravariant velocity components 
as 

Hi — £^ijUj 


U v — 0 on all Bs (12) 

for flow tangency, or 

=0, i = 1,2,3 on all Bs (13) 

for no-slip boundary conditions. In this notation, r / 
is 1, 2, or 3 depending on the direction that is nor- 
mal to face Bs where a solid surface is indicated. At 
the far field boundary faces, Bp, freestream condi- 
tions are specified for incoming waves, while outgo- 
ing waves are determined by the solution. 

The time integration scheme follows that used in the 
single block solver [34]. The solution proceeds by 
performing the cell flux balance, updating the flow 
variables, and smoothing the residuals at each stage 
of the time-stepping scheme and at each level of the 
multigrid cycle. The main difference in the inte- 
gration strategy is the need to loop over all blocks 
during each stage of the process. The use of the 
double-halo configuration permits standard single- 
block subroutines to be used, without modification, 
for the computation of the flow field within each indi- 
vidual block. This includes the single-block subrou- 
tines for convective and dissipative flux discretiza- 
tion, viscous discretization, multistage time step- 
ping, and multigrid convergence acceleration. 

The only difference between the integration strate- 
gies is in the implementation of the residual aver- 
aging technique. In the single-block solution strat- 
egy, tridiagonal systems of equations are set up and 
solved using flow information from the entire grid. 
Thus, each residual is replaced by a weighted av- 
erage of itself and the residuals of its neighbors in 
the entire grid. In the multiblock strategy, the sup- 
port for the residual smoothing is reduced to the ex- 
tent of each block, in order to eliminate the need to 


4 



solve scalar tridiagonal systems spanning the blocks, 
which would incur a penalty in communication costs. 
Depending on the topology of the overall mesh, the 
setup of tridiagonal systems that follow coordinate 
lines may lose the physical interpretation that it had 
in the single block implementation. This change has 
no effect on the final converged solution, and in all 
applications of the solver has not led to any reduc- 
tion in the rate of convergence. 


Viscous discretization 


In order to include the viscous terms of the Navier- 
Stokes equations into the spatial discretization 
scheme it is necessary to approximate the velocity 
derivatives which constitute the stress tensor 
t Tij . These derivatives may be evaluated by apply- 
ing the Gauss formula to a control volume V with 
boundary S : 

[ dV = [ UinjdS , 

Jv ox j Js 

where rij is the outward normal. For a hexahedral 
cell this gives 


duj 

dxj 


= 1 y 

v Zs 


faces 


Wf Ttj $ ) 


(14) 


where u* is an estimate of the average of ui over the 
face, rij is the jth component of the normal, and 5 
is the face area. 


In a cell centered scheme, the integration is carried 
out on a dual mesh obtained by connecting the cen- 
ters of the computational cells in the original mesh. 
This process yields an approximation of the stress 
tensor at the vertices of the original computational 
mesh. Once the stress tensor is computed at the 
cell vertices, it is averaged at the face centers before 
computing the viscous flux balance. This discretiza- 
tion is very efficient because it does not require the 
evaluation of the gradients separately for each cell 
face. However, as a consequence of the averaging 
process, the discretization may admit odd/even de- 
coupling modes. These modes should be damped by 
the third-order artificial dissipation already added to 
damp the odd/even modes arising from the central 
difference approximation of the convective terms. 
Alternatively, it is possible to add a correction sten- 
cil to the velocity gradients calculated at the ver- 
tices to approximately convert it to a more compact 
stencil characteristic of a face-centered approach [31] 
without increasing the computational cost. The first 
approached described is used here. 

The implementation of this discretization procedure 
for the viscous terms in the multiblock method re- 
quires only a single halo for the both the flow values 
and the grid locations. 


MULTIBLOCK DESIGN STRATEGY 

With the discussion of the multiblock flow solver 
completed, we will now describe the adjoint based 
design methodology. The development and imple- 
mentation of adjoint approaches for aerodynamic 
shape optimization has reached a stage of maturity 
in which problems of practical interest are starting 
to be considered. In one of our recent publications, 
both transonic and supersonic shape optimizations 
were performed for complex aircraft configurations 
subject to a variety of geometric constraints [49]. Al- 
though the inviscid Euler equations were used as the 
core CFD algorithm for these design calculations, 
an accompanying paper at the same conference pre- 
sented the derivation of a viscous adjoint algorithm 
and demonstrated a preliminary wing-body design 
capability using a viscous flow solver coupled with 
an inviscid adjoint solver [32]. The methodology pre- 
sented here will follow this latter approach to allow 
for the capability of complete aircraft shape design 
in the presence of viscous effects. 

The course of action can be described sis follows: 
first, the structured multiblock Navier- Stokes flow 
solver described in the previous section replaces the 
inviscid multibiock method used in reference [49]. 
Therefore, although the expression for the cost func- 
tion does not include viscosity related items, it re- 
flects the effects of the presence of the boundary 
layer. The ability to perform inviscid design cal- 
culations is retained with a simple input flag. The 
gradient of the cost function with respect to geomet- 
ric design variables is then calculated via the solution 
of the inviscid adjoint system of [49]. 

The lack of a viscous adjoint solver corresponding 
to the viscous flow solver has various important im- 
plications. First and foremost is the fact that the 
use of an Euler adjoint is mathematically inconsis- 
tent with a cost function evaluated via a set of vis- 
cous governing equations. Therefore, it is impossible 
to obtain gradients from this approach that match 
those obtained using finite differencing. However, 
for problems of engineering interest, the objective is 
not necessarily to find the true optimum at all cost, 
but to get within a reasonable vicinity of the mini- 
mum for a cost that is acceptable. 

It is interesting to note that other design approaches 
also suffer from a similar inconsistency for reasons 
of engineering interest. Quasi-inverse design meth- 
ods such as those used by Campbell [10, 11] as- 
sume a relationship between the pressure distribu- 
tion and the local surface curvature. This relation- 
ship effectively provides an inconsistent gradient in 
order to obtain improved designs. In reference [11] 
the idea has been pursued in applications using the 
Navier-Stokes equations. These approaches, which 
may be applicable for a small sub-class of problems, 
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are likely to fail in situations where the heuristic as- 
sumptions used to obtain gradient information cease 
to be valid. 

With these ideas in mind, it is important to con- 
sider both the advantages and the limitations of the 
present design technique. If the aerodynamic figure 
of merit to be minimized has a direct dependence on 
viscosity such as through the friction drag, the ap- 
proach is rendered invalid since the inviscid adjoint 
system lacks direct sensitivity to viscosity. However, 
for problems in which viscosity plays an indirect role 
the proposed design technique is bound to produce 
useful results. Some important aerodynamic shape 
optimization problems fall into this latter category. 
Take for example the problem of pressure drag min- 
imization for commercial transport aircraft. With- 
out breakthroughs in either laminar flow control or 
turbulent skin friction reduction technologies, most 
of the aerodynamic performance improvements at- 
tainable for a given configuration can be achieved 
through pressure drag minimization (both induced 
drag and wave drag). In addition, since the pressure 
gradient normal to a viscous boundary layer for air- 
craft at cruise conditions is negligible, the pursuit of 
inviscid methods for aerodynamic shape optimiza- 
tion has yielded moderate success [15, 50, 49]. 

However, inviscid design methods must be used cau- 
tiously even for inverse pressure distribution or pres- 
sure drag minimization problems, since the viscous 
effects will indirectly alter these quantities. The 
most noticeable effect is due to the boundary layer 
displacement thickness. The magnitude and impor- 
tance of the effective changes in wing shape caused 
by the presence of the boundary layer depend on the 
flow field in question, and generally become more 
pronounced under transonic conditions. The po- 
sition and strength of shock waves as well as the 
level of pressure recovery at the trailing edge can be 
strongly impacted by the existence of a boundary 
layer. In transonic flow, it is thus highly desirable 
to take viscous effects into account when designing 
the aerodynamic shape of a wing to minimize pres- 
sure drag. 

When the effect of the boundary layer on the outer 
flow couples very strongly, as is the case at transonic 
buffet or at maximum lift coefficient conditions, the 
ability to perform meaningful design without a vis- 
cous adjoint can be questioned. 

In summary, a design methodology has been devel- 
oped that uses the Navier-Stokes equations for the 
flow solution and an inviscid adjoint formulation to 
obtain gradient information. This method is suitable 
for a large class of problems of practical aerodynamic 
interest. For problems in which the viscous effects 
dominate the behavior of the flow, the viscous for- 
mulation of the adjoint equations more than likely 


will be necessary. It is our intention to pursue this 
issue further in the coming months. 


Adjoint Solver 

The mathematical development of the inviscid ad- 
joint equations used in this research has been exten- 
sively discussed in our earlier work [23, 24, 25, 29, 45, 
33, 46, 47, 48, 50]. An introductory treatment of the 
derivation of a viscous adjoint has been given in ref- 
erence [32]. In this section we present a short review 
of the development of the inviscid adjoint equations 
for the illustrative problem of pressure drag mini- 
mization subject to a variety of constraints. 



C D 


Ca cos a + Cn sin a 


1 

S ref 



cos a -f S y sin a) d £ i d£ 2 , 


where S x and S y define projected surface areas, S re f 
is the reference area, and d£ 1 and d £ 2 are the two 
coordinate indices that are in the plane of the face 
in question. Note that the integral in the final ex- 
pression above is carried out over all solid boundary 
faces. The design problem is now treated as a control 
problem where the control function is the geometry 
shape, which is chosen to minimize /, subject to the 
constraints defined by the flow equations. A vari- 
ation in the shape will cause a variation 6p in the 
pressure and consequently a variation in the cost 
function 

61 = SC D + ^ 6a 
da 

where 6Cd is the variation due to changes in the 
design parameters with a fixed. To treat the prob- 
lem of practical design, drag must be minimized at 
a fixed lift coefficient. Thus an additional constraint 
is given by 

SC L = 0, 

which yields 


6C l + ^ 6a = 0. 
da 

Combining these two expressions to eliminate 
gives 

(QQn.) _ 

61 = 6C d - j$Ht 6C l . (15) 

Since p depends on w through the equation of state, 
the variation 6p can be determined from the vari- 
ation 6w. If a fixed computational domain is used, 
the variations in the shape result in variations in the 
mapping derivatives. Define the Jacobian matrices 

Ai = §^, Ci = QijAj. (16) 
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Then the equation for Sw in the steady state be- 
comes 

wP SFi) = 0 ’ (17) 

where in the domain 


8Fi = C{8w + 6 ( Qij ) fj, 
and on the solid surface, 


5F V = < 


' 0 ' 


f 0 ' 

Q v iSp 


*(Qn i) 

< Qn-zSp 

► +P < 

HQm) 

QvsSp 


HQn*) 

> 0 , 


> 0 , 


> on any Bs- 


( 18 ) 


Now, multiplying equation (17) by a vector co-state 
variable ip, assuming the result is differentiable, and 
integrating by parts over the entire domain, 

Id d ^ ~ L ^ = °’ (19) 


where hi are components of a unit vector normal to 
the boundary. Equation (19) can now be subtracted 
from equation (15) without changing the value of 
81. Then ip may be chosen to cancel the explicit 
terms in 8w and 8p. For this purpose ip is set to the 
steady-state solution of the adjoint equation 


dip 

~di 


-Cfg=° in D, 


( 20 ) 


with the surface boundary condition 


(faQni + fa Qrj 2 + iPaQvs) = Q on all B s , (21) 
where 

Q = I M l c {(5a; cos a + S y sin a) 

27 iW oo s ref 

+Vt (S y cos a — 5a; sin a)} . 


At internal block boundaries, the face integrals can- 
cel from the contributions of the adjacent blocks. At 
the far field the choice of the adjoint boundary con- 
ditions depends on whether the flow is subsonic or 
supersonic. For subsonic flow, so long as the outer 
domain is very far from the configuration of interest, 
we may set 


^01—5 =0 on all Bp- 

It is noted that the waves in the adjoint problem 
propagate in the opposite direction to those in the 
flow problem because of the transpose in equation 
( 20 ). 


Finally we obtain the expression 
SI = 

s ref J J j 

+0 ( SS y cos a — SS X sin a)} d£id& 

J 1> T p£ i (SQiifj)dt k . ( 22 ) 


^ — / / Cp { (8S X cos a + SS y sin a) 
°ref J Jb s 


+ 


In order to evaluate the changes in the cost from 
the above expression, the function ip must be de- 
fined through the solution of (20). A major dif- 
ference between the development of adjoint solvers 
presented by our group and those cited in refer- 
ences [4, 5, 6, 8, 7, 13, 14, 9, 38, 36, 20, 41, 37, 35] is 
that we have relied on a continuous formulation. In 
this formulation the adjoint system of equations is 
derived starting from the continuous governing equa- 
tions to produce a set of continuous co-state equa- 
tions including boundary conditions. This set of co- 
state equations is then discretized for computational 
analysis as a final step. A discrete formulation in- 
terchanges the order of these operations by starting 
from the discrete governing equations and employing 
linear algebra to obtain a discrete adjoint system. It 
is useful to note that the final results from these two 
approaches can be explained as alternate discretiza- 
tions of the continuous adjoint formulation. 

This subtle difference in the order of the adjoint and 
discretization operations has several important im- 
plications. Some of these differences have been ex- 
plored in detail in the first author’s Ph.D. disser- 
tation [52] as well as in the recent work by Ander- 
son and Venkatakrishnan [3]. For purposes of the 
present work, it is important to focus on one partic- 
ular difference between the continuous and discrete 
adjoints. If a discrete approach is followed, gradi- 
ents obtained via either the resulting adjoint or di- 
rectly through finite differences should converge to 
the same result. Hence, the very idea of using an 
inviscid adjoint for a viscous state equation would 
not exist. The combination of methods used in this 
paper is derived from the natural flexibility of em- 
ploying a continuous adjoint formulation. 

Details of the particular discretization used here are 
covered in reference [52]. The discrete adjoint sys- 
tem is solved in precisely the same manner as the 
flow equations. More details of the approach as well 
as the development for other cost functions have 
been presented in references [25, 29, 33, 46, 47, 50, 
43, 49]. 


Design Variables and Underlying 
Geometry Database 
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Even with the rapid developments of the last few 
years regarding the derivation and implementation 



of adjoint solvers, many unresolved issues require 
further research efforts. Not the least of these re- 
maining difficulties is the precise description of the 
machinery used to modify the shape of interest. This 
choice directly affects other aspects of the design al- 
gorithm. Observing that equation (22) requires not 
only the flow and the adjoint solutions, but also vari- 
ations in the mesh metrics, we see the importance of 
choosing the design variable formulation. In order 
to obtain all the discrete gradient components from 
equation (22) it will be necessary either to develop 
an analytic expression for the variation in mesh met- 
rics or to calculate them directly. The availability of 
this choice will be determined by the choice of the 
design variable formulation. 

Available choices for the design variables span a wide 
spectrum ranging from employing the locations of 
the actual mesh points, to relying on the analytic 
control points used in a CAD definition of the ge- 
ometry. 

In the case of using the actual mesh points, no un- 
derlying geometry database exists. Constraints, if 
present, must be imposed directly on the locations of 
these mesh points. This approach will surely prove 
problematic in general. Consider, for example the 
difficulties involved in the imposition of a wing fuel 
volume constraint. In addition, the treatment of 
surface intersections (such as the wing-body) raises 
difficulties for this approach since the path for the 
motion of the mesh points lying directly on these 
intersections is ill-defined. 

However, an advantage of using the mesh points as 
design variables is that, when combined with an an- 
alytic mesh mapping transformation, the calculation 
of the gradient can be performed without explicitly 
computing the variations in the mesh metrics. Un- 
fortunately, obtaining such a general mapping trans- 
formation increases in difficulty with added geomet- 
ric complexity. 

The alternative of using an underlying geometry 
database, which may be modified either by the di- 
rect application of design variables or by changes 
in the coefficients of its possibly analytic defini- 
tion, also has its advantages. First, since the raw 
unintersected geometries are available, constraints 
and design changes affecting intersections are eas- 
ily treated. This can be done without regard to the 
actual mesh that is used for the flow and adjoint 
calculations. However, these strengths are counter- 
balanced by the fact that additional computational 
work is required to calculate the mesh metric varia- 
tions. 

In the current research, we have used an underlying 
geometry database where a set of simple geometric 
entities, such as wings and bodies, are input to the 


design algorithm in addition to the multiblock mesh 
used for the calculations. Design variables which axe 
defined as a set of analytic shape functions are ap- 
plied directly to these geometric entities. Linear and 
nonlinear geometric constraints are then evaluated 
on these primary entities. At any particular point 
in the design process, changes to the mesh surfaces 
are obtained by first intersecting all of the geomet- 
ric entities to construct a set of parametric surfaces 
representing the complete configuration. The loca- 
tion of each surface mesh point on this parametric 
representation of the geometry is determined for the 
initial configuration in a pre-processing step. Thus, 
the results of this pre-processed mapping from para- 
metric geometry to the computational surface mesh 
points is also a part of the necessary input. The 
perturbed surface mesh point locations are deter- 
mined by evaluating the parametric geometry sur- 
faces at these predetermined locations. Once the 
surface mesh points have been updated, the volume 
mesh may be perturbed (see following section on 
mesh motion) and either the gradient or the solution 
can be calculated. The important feature of this ap- 
proach is that a set of simple geometric entities lies 
at the core of the entire design process. This tech- 
nique retains the typical way in which aerodynamic 
vehicles are defined, and provides strict control over 
how surface intersections are treated. Furthermore, 
since the chosen design variables act directly on the 
geometric entities, at the end of the design process 
these entities may be output for future analysis. 

In the current implementation, input geometric en- 
tities are restricted to those defined by sets of 
points. However, in the future, CAD entities such 
as NURBS surfaces will also serve this role, thereby 
allowing both the input and the output from the 
aerodynamic surface optimization method to inter- 
face directly with a CAD database. 


Mesh Perturbation Algorithm 

After we have applied a set of design variables to the 
underlying geometry and mapped these changes to 
changes in the computational surface mesh points, 
two related tasks remain. For gradient calculations, 
variations in the mesh metrics must be calculated. 
In addition, when a design step is to be taken it must 
be possible to deform the entire mesh to accommo- 
date design changes. Both tasks are accomplished 
in this work by the approach presented in references 
[49] and are only outlined here. 

Since it would be difficult in the current applica- 
tion to obtain an explicit relationship between ar- 
bitrary surface changes and variations in the multi- 
block mesh metrics, these latter quantities are cal- 
culated by finite differences. This approach avoids 
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the use of multiple flow solutions to determine the 
gradient, but it unfortunately still requires the mesh 
to be regenerated repeatedly. The number of mesh 
generations required is proportional to the num- 
ber of design variables. The inherent difficulty in 
the approach is two-fold. First, for complicated 
three-dimensional configurations, elliptic or hyper- 
bolic partial differential equations must normally 
be solved iteratively in order to obtain acceptably 
smooth meshes. These iterative mesh generation 
procedures are usually computationally expensive. 
In the worst case they approach the cost of the flow 
solution process. Thus the use of finite difference 
methods for obtaining metric variations in combina- 
tion with an iterative mesh generator leads to com- 
putational costs which strongly hinge on the num- 
ber of design variables, despite the use of an adjoint 
solver to eliminate the flow variable variations. Sec- 
ond, multiblock mesh generation is by no means a 
trivial task. In fact no method currently exists that 
allows this to be accomplished as a completely au- 
tomatic process for complex three-dimensional con- 
figurations. 

Here, these difficulties are overcome through the 
use of a mesh perturbation technique. In this ap- 
proach, a high quality mesh appropriate for the flow 
solver is first generated by any available procedure 
prior to the start of the design. In examples to be 
shown later, these meshes were created using the 
Gridgen software developed by Pointwise, Inc. [53]. 
This initial mesh becomes the basis for all subse- 
quent meshes which are obtained by analytic per- 
turbations. 

In order to perturb the multiblock mesh, two ca- 
pabilities are required. First, the block corners, 
edges and faces must be moved in a manner that 
follows the desired geometric changes and simulta- 
neously retains mesh continuity throughout the do- 
main. The second requirement is to move all the 
points interior to each block such that the spac- 
ing distributions and smoothness of the original 
mesh are retained. This latter requirement is ac- 
complished by the WARP3D algorithm [43]. Since 
our current flow solver and design algorithm assume 
a point-to-point match between blocks, each block 
may be independently perturbed by WARP 3D, pro- 
vided that perturbed surfaces are treated continu- 
ously across block boundaries. The methodology 
used to achieve the first requirement of maintain- 
ing continuity in the blocking structure is given as 
follows: 

1. All faces that are directly affected by the design 
variables (active faces) are explicitly perturbed. 

2. All edges that touch an active face, either in 
the same block or in an adjacent block, are im- 


plicitly perturbed by a simple arc-length-based 
algorithm. 

3. All inactive faces that either include an implic- 
itly perturbed edge or abut to an active face 
are implicitly perturbed by a quasi-3D form of 
WARP 3D. 

4. WARP3D is used on each block that has one or 
more explicitly or implicitly perturbed faces to 
determine the adjusted interior points. 

Note that much of the mesh, especially away from 
the surfaces, will not require mesh perturbations and 
thus may remain fixed through the entire design pro- 
cess. Close to the surfaces, many blocks will either 
contain an active face or touch a block which con- 
tains an active face, either by an edge or by a corner. 
As the design variations affect the active faces, the 
above scheme ensures that the entire mesh will re- 
main attached along block boundaries. Added com- 
plexity is needed to accomplish step (2) since the 
connectivity of the various edges and corners must 
be indicated somehow. Currently, pointers to and 
from a set of master edges and master corners are 
determined as a pre-processing step. During the de- 
sign calculation, perturbations to any edges or cor- 
ners are fed to these master edges and master cor- 
ners which in turn communicate these changes to all 
connected edges and corners. 

Since this mesh perturbation algorithm is explicit 
it is possible to work out the analytic variations in 
the metric terms required for equation (22). This 
approach was followed in [46]. However since the 
mesh perturbation algorithm that is used in the cur- 
rent paper was significantly more complex, and it 
was discovered that the computational cost of re- 
peatedly using the block perturbation algorithm was 
within reason, finite differences were used to calcu- 
late SQij instead of deriving the exact analytical re- 
lationships. 

Optimization Algorithm and 
Problem Constraints 

With all of the machinery to obtain gradients for an 
arbitrary set of design variables in place, it remains 
as a final detail to outline the numerical optimization 
algorithm and the imposition of constraints. The 
NPSOL optimization algorithm employed here was 
chosen because of its extensive past use on aero- 
dynamic optimization problems, and its treatment 
of both linear and nonlinear inequality constraints. 
NPSOL [16] is a sequential quadratic programming 
(SQP) method in which the search direction is calcu- 
lated by solving the quadratic subproblem where the 
Hessian is defined by a quasi-Newton approximation 
of an augmented Lagrangian merit function. The 
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Lagrange multipliers in this merit function serve to 
scale the effect of any nonlinear constraints that the 
design may contain. Linear constraints are treated 
by solving the quadratic subproblem such that the 
search direction remains in feasible space. A com- 
plete treatment of the method and other optimiza- 
tion strategies is given in [17]. 

The entire design procedure is outlined below: 

1. Decompose the multiblock mesh into an appro- 
priate number of processors, and create lists of 
pointers for the communication of the processor 
halo cells. 

2. Solve the flow field governing equations (6-11) 
for each design point. 

3. Solve the adjoint equations (20) subject to the 
boundary condition (21) for each design point. 

4. For each of the n design variables repeat the 
following: 

• Perturb the design variable by a finite step 
to modify the geometric entities. 

• Reintersect the geometric entities and 
form parametric geometry surfaces. 

• Explicitly perturb all face mesh points af- 
fected by the geometry changes by evalu- 
ating their locations on the parametric ge- 
ometries. 

• Implicitly perturb all faces that share an 
edge with an explicitly perturbed face. 

• Obtain the perturbed internal mesh point 
locations via WARP3D for those blocks 
with perturbed faces. 

• Calculate all the delta metric terms, 6Qij, 
within those blocks that were perturbed 
using finite differencing. 

• Integrate equation (22) to obtain 61 for 
those blocks that contain nonzero SQij, 
and for each design variable, to determine 
the gradient component. 

5. Calculate the search direction and perform a 
line search via NPSOL. 

6. Return to (2) if a minimum has not been 
reached. 

PARALLELIZATION STRATEGY 

Communication Management 

Efficient parallel computation on a set of distributed 
processors is achieved by a combination of minimiz- 
ing the overhead of communication between these 


processors and balancing the partitioned workload 
among them. The first obvious choice in order to im- 
prove parallel performance is then to minimize the 
amount of communication required between proces- 
sors. This includes minimizing the number of mes- 
sages sent and received (latency) as well as the total 
amount of data to be transferred (bandwidth) . Fur- 
ther improvement may be achieved by using other 
techniques such as latency hiding and scheduled 
communication, if either the problem or the specific 
architecture of the distributed platform allow it. 

Latency hiding is in itself a form of local paral- 
lelism, where the communication and computations 
of an individual node proceed concurrently with each 
other. This benefit is accomplished by initiating the 
bi-directional asynchronous communication as soon 
as the data to be passed has been updated and in 
such a manner that calculations that can be done 
(without updated data from remote nodes) are per- 
formed while the communication continues in the 
background. 

In the case of networks which utilize a collision- 
detection based protocol (e.g., ethernet), scheduled 
information transfer may help reduce the commu- 
nication overhead. A consequence of the collision- 
detection mechanism is that the effective bandwidth 
of a saturated network is degraded since a portion of 
it is wasted when two or more processors are trying 
to initiate communication simultaneously. Hence, by 
scheduling or synchronizing the messages between 
processors, one can minimize this deterioration in 
performance. On the other hand, the current gener- 
ation of parallel computers typically include a net- 
working environment which is capable of sustaining 
simultaneous communication among all the proces- 
sors in the machine. This enhanced communication 
ability comes with a high price tag. An intermedi- 
ate solution where hardware switching is embedded 
in a workstation distributed computing environment 
will be shown to be a compromise between these two 
options. 

In this paper, we revisit the task of minimizing the 
amount of total communication required between 
processors. However, we avoid the temptation to 
communicate less often than is consistent with the 
baseline serial calculation so that the convergence of 
the original scheme is exactly maintained. The pos- 
sibility of improving the parallel performance of the 
method by further restricting the amount of commu- 
nication at different points in the multigrid sequence 
will be investigated at a later time. 

The following subsections describe the baseline 
method against which all improvements are mea- 
sured, and the modifications (associated with com- 
munication overheads) developed under the present 
work. 
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Baseline Communication 

The multiblock solver is parallelized by using a do- 
main decomposition model, a SPMD (Single Pro- 
gram Multiple Data) strategy, and the MPI Library 
for message passing. 

The baseline parallel scheme is exactly consistent 
with the serial multiblock solution: the results pro- 
duced by both programs are identical, including the 
convergence history of the method. Updates to the 
solution vector in all processors occur at every stage 
of the Runge-Kutta time-stepping scheme and in 
every level of a multigrid W-cycle. In addition, 
the baseline computations are performed with 64-bit 
arithmetic. Therefore, flow residuals in the calcula- 
tion can be converged approximately 13 orders-of- 
magnitude before roundoff effects stall the conver- 
gence process. 

The multiblock strategy adopted in this work allows 
the independent update of the internal cells of ev- 
ery block in the mesh by using a halo or ghost cell 
approach. The information in this halo of cells sur- 
rounding each block is transferred from the corre- 
sponding physical cells in the interior of the neigh- 
boring blocks. The baseline scheme utilizes a three- 
pass communication model which allows for the com- 
putation of solutions on arbitrarily oriented multi- 
block meshes. 

Under this model, updated halo information is trans- 
ferred across the six faces of each block during each 
phase of the three-pass communication. The first 
pass transfers face information, the second pass pro- 
vides edge data, and the final pass is required to 
update the solution across the block corners. With 
this three-pass approach, each block is guaranteed to 
have the proper information in its complete halo (in- 
cluding edges and corners) regardless of the topology 
of the mesh. This is a particularly challenging sit- 
uation when more than four blocks meet at a given 
edge. The double halo is used to compute the third- 
order artificial dissipation terms while preserving a 
fully conservative scheme. Although the current dis- 
cretization of the viscous fluxes requires only a single 
level halo, future variations which require the pres- 
ence of a double halo can be accommodated with 
this procedure. 

In addition to the above, the blocks of the baseline 
solution are distributed to the individual processors 
in such a manner that the total number of unknowns 
per processor is as evenly loaded as possible. While 
finding the optimum distribution is recognized to be 
an NP- Complete problem, a simple algorithm is em- 
ployed which routinely yields a load balancing in 
the neighborhood of the optimum. The essence of 
this algorithm is to take the largest of the remaining 
blocks (yet to be distributed) and assign it to the 


Processor 

Ncells 

Cell-Ratio 

1 

185,088 

1.007 

2 

185, 088 

1.007 

3 

182,400 

0.993 

4 

182,400 

0.993 


Table 1: Baseline Load-Balance for the Benchmark 
Test Case on 4-Processors. 

processor with the smallest current load. Repeating 
this procedure until ail blocks have been distributed, 
an effective load balance algorithm is obtained. Ta- 
ble 1 illustrates the effectiveness of this algorithm 
for our benchmarking test case. 

The test case of Table 1 is used to benchmark the 
effectiveness of the enhancements described below. 
It is a 72-block grid about a wing-fuselage-nacelle 
geometry. The total number of cells in the system 
is 734, 976. Of these, roughly 300, 000 are halo cells. 
This case was chosen specifically because it accentu- 
ates the penalty of communication. Yet, on a high- 
speed, low-latency network such as those on the IBM 
SP2 or the SGI Origin2000, the corresponding flow 
solutions scale reasonably with the number of pro- 
cessors. 


One-Pass Communication Model 

The use of the original three-pass communication 
model was necessary for handling a completely gen- 
eral block structure. Drawbacks of this approach 
are that redundant communications are performed 
and that the second and third passes must wait un- 
til the previous passes have completed before they 
are started. 

The source of redundant data passing can be seen 
by following the flow of information from one block 
to a neighboring block coincident with an edge or a 
corner of the originating block. For example, across 
an edge, information from one block to another (lo- 
cated above and to the right) can flow in one of two 
ways. Firstly, the data could flow from the origi- 
nating block to its right-hand neighbor, then this 
information could be transferred from this neigh- 
bor to the block directly above it. Alternatively, 
the data could first move upward, then to the right. 
Because of the complexity involved in determining 
which path the data should flow along and which 
it should not, the baseline three-pass model trans- 
fers the information in both directions. Similarly, 
for communications across corners, this redundancy 
is three-fold. 

Hence, an obvious source of improvement is to re- 
move redundant data transfers from the communi- 
cations model. This is accomplished by adopting 
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Processor 

1 

2 

3 

4 

MPI 

Processor 

1 

2 

3 

4 

MPI 

1 

22,848 

30, 080 

27,040 

12, 960 

70, 080 

1 

17,504 

26, 480 

21,128 

13,744 

61,352 

2 

30, 080 

23, 296 

5, 376 

20, 224 

55,680 

2 

26, 496 

17, 696 

8,264 

16,264 

51,024 

3 

27, 040 

5,376 

21,248 

31,648 

64,064 

3 

21,232 

8,256 

16, 792 

27, 864 

57, 352 

4 

12, 960 

20, 224 

31,648 

12, 992 

64,832 

4 

13, 776 

16,232 

27, 768 

10,136 

57, 776 

MPI 

70, 080 

55,680 

64,064 

64,832 

254, 656 

MPI 

61,504 

50, 968 

57, 160 

57, 872 

227, 504 


Table 2: Three-Pass Communication Matrix using 
the Baseline Load-Balance. 

a single-pass scheme which reproduces exactly the 
end state of the original three-pass model. In or- 
der to ensure that the the one-pass communication 
model produces results identical to the three-pass 
approach, the original three-pass model is used to 
initialize the communication lists of the one-pass 
method. This is accomplished in the following man- 
ner: after the blocks of the grid system have been as- 
signed to an appropriate processor through the load 
balancing procedure, the solution vector is colored 
with information that describes its starting location 
prior to any communication. This encoding includes 
information such as block and processor numbers 
and local cell indices. With this state set, the so- 
lution vector is processed with the original three- 
pass communication model. Upon completion of this 
data transfer, every halo cell in the distributed sys- 
tem has been reset with information which points 
back to its origin, i.e. block number, processor num- 
ber and “distant” cell index. At this stage, new com- 
munication lists are constructed and returned to the 
source processor which stores them for future use by 
the one-pass model. 

Tables 2 and 3 illustrate the reduction in commu- 
nication achieved for the one-pass model. These 
tables provide the communication matrix (for the 
finest mesh in the multigrid sequence) of message 
sizes for each communication approach. The diago- 
nal terms of these matrices correspond to messages 
that a processor needs to send to itself. For this 
kind of message, the present method uses a local 
memory copy instead of an actual MPI (Message- 
Passing Interface) message, which is used for inter- 
processor communication. For the benchmark test 
case, the one-pass model reduces the total message 
length by about 11% on the fine mesh. However, 
because there is no forced synchronization between 
passes as in the three-pass model, the overhead re- 
duction approaches 25%. 


Delta Updates 

In the baseline code, communication always trans- 
ferred the actual values of the solution vector. In 
order to preserve 64-bit accuracy, all of these values 
were transferred as 64-bit floating-point numbers. In 
the present one-pass model, an additional choice of 


Table 3: One-Pass Communication Matrix using the 
Baseline Load-Balance. 

communication model has been implemented. We 
refer to this communication model as the delta up- 
date procedure. 

The purpose of including a delta form in the present 
work is motivated by the fact that these delta in- 
crements can be transferred as 32-bit numbers while 
maintaining 64-bit accuracy in the converged solu- 
tion. Naturally, maintaining this level of precision 
during the course of convergence requires an occa- 
sional reset of the halo values with a 64-bit commu- 
nication, although the large majority of the commu- 
nication is now performed using only single precision 
32-bit numbers. For this occasional reset, we have 
maintained the capability to transfer actual full pre- 
cision values of the solution vector. 

For all practical purposes, the communication over- 
head of the new delta form is half that of the baseline 
(full precision) transfers. 


Communication- Weighted Load Balancing 

As mentioned above, the original load-balancing al- 
gorithm was guided solely by the number of cells 
being distributed to the complete set of processors. 
This form of load balancing has proved to be quite 
acceptable for platforms with state-of-the-art com- 
munication capabilities such as the IBM SP2. How- 
ever, for a cluster of workstations linked together 
with a lower performance network, this technique 
can be further refined. 

A new load-balancing algorithm has been developed 
which includes the penalties associated with out-of- 
processor communication. In this setting, the load 
is defined as the time it takes each processor to com- 
plete all of its tasks-numerical processing as well as 
sending and receiving the necessary messages. The 
predicted times of each of these tasks are derived us- 
ing experimentally obtained MFLOPS (Millions of 
Floating Point Operations per Second) ratings, and 
the MPI latency and bandwidth values associated 
with the particular distributed platform. 

The new load-balancing algorithm is very similar to 
that of the original method, but the “size” of each 
block is now initialized assuming that the informa- 
tion of all halo cells will be transferred to another 
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LATENCY 

BANDWIDTH 

PLATFORM 

MFLOPS 

(M-sec) 

(MBytes/s) 

SP2 Switch US 

50 

43 

35.0 

SP2 Switch IP 

50 

285 

13.0 

HP/J280-100BaseT 

30 

290 

7.0 

HP / J 280-ethernet 

30 

600 

0.8 


Table 4: Observed Capacities of Various Platforms. 


processor. The algorithm then proceeds by taking 
the largest of the remaining blocks (yet to be dis- 
tributed) and temporarily assigning it to every pro- 
cessor. When assigned to each processor, a tempo- 
rary update of the load of that processor is made by 
adding the size of the current block to that proces- 
sor’s previous load. This assignment is rewarded by 
a decrease in the equivalent size if neighboring blocks 
are already assigned to that processor and thus no 
communication is necessary. After all temporary as- 
signments have been done, the processor whose load 
is the smallest after the assignment is selected and 
the block is permanently assigned to that processor. 
The previous steps are repeated until all blocks have 
been distributed. 

Table 4 provides representative values for the IBM 
SP2 using the switch in User Space mode (high 
performance communication mode), for the switch 
and IP (Internet Protocol), and for an HP worksta- 
tion cluster of J280s using switched-lOOBaseT and 
standard ethernet. These values have been exper- 
imentally observed and may vary from site to site. 
They correspond to the measured values of latency 
and bandwidth using various implementations of the 
MPI standard on the different platforms mentioned 
above. They are not the manufacturer’s published 
data for the communication hardware. It is inter- 
esting to note that the values of latency and band- 
width obtained for the switched- 100BaseT network 
are quite close to those for the IBM SP2 system 
communicating in IP mode. Therefore, the SP2 can 
be used to simulate large networks of workstations 
linked together by a switched-lOOBaseT network. 

Using the network characteristics for the HP cluster 
on standard ethernet, a new distribution of blocks is 
obtained. This distribution is provided in Table 5. 
Comparing it with Table 1, it is noticed that the 
number of cells per processor is not nearly as well 
’’balanced” as before. Yet, the solution’s cycle time 
of the new distribution on the HP cluster using an 
ethernet network is only 57.46 seconds as compared 
with the original cycle time of 108.38 seconds. 

The secret of this performance improvement can be 
seen by comparing the communication matrices of 
the original load-balanced distribution with that of 
the improved one. This information is provided by 
Tables 3 and 6, respectively. Notice that the new 


Processor 

Ncells 

Cell-Ratio 

1 

169,536 

0.923 

2 

203, 136 

1.106 

3 

188,544 

1.026 

4 

173, 760 

0.946 


Table 5: New Load-Balance for the Benchmark Test 
Case on 4-Processors. 


Processor 

1 

2 

3 

4 

MPI 

1 

28, 848 

11,888 

16,736 

4,304 

32, 928 

2 

11,968 

42,168 

2,368 

16, 328 

30,664 

3 

16, 744 

2,552 

43, 840 

11,696 

30, 992 

4 

4,456 

16, 576 

11,648 

35,384 

32, 680 

MPI 

33, 168 

31,016 

30, 752 

32,328 

127, 264 


Table 6: One-Pass Communication Matrix using the 
New Load-Balance. 


load-balancing algorithm has done an effective job 
of reducing the amount of data to be transferred via 
MPI. Under the baseline distribution, MPI messages 
are used to set a total of 227,504 halo cells in the fine 
mesh. Using the new load-balancing algorithm, MPI 
calls are only responsible for resetting now 127,264 
halo cells. This is accomplished by increasing the 
amount of data transfer each processor does with 
itself (i.e., in a global sense, communication is drawn 
toward the diagonal of these matrices) . 

Single-Layer Halo Communication 

In the baseline method, we stated that a double- 
layer halo surrounds each block and it is utilized 
to facilitate calculation of the third-order artificial 
dissipation fluxes. However, upon close inspection 
of the 5-stage Runge-Kutta scheme and multigrid 
processes, we note that the dissipative fluxes are not 
recomputed as often as the solution updates occur. 
In particular, these dissipative terms are typically 
reset only during the odd stages of Runge-Kutta on 
the finest mesh and never computed in any of the 
coarser levels of the multigrid scheme. 

Immediately, we can omit transferring the outer- 
layer halo data during the even (of five) stages of 
Runge-Kutta in the fine mesh. This reduces the fine- 
mesh communication by 20%. 

For a 4-level multigrid W-cycle in the baseline code, 
more than 45% of the total data transferred during 
the cycle resides in the coarser-level meshes, tt By 
updating only the data of the inner halo during the 
coarse-level communication, an additional improve- 
ment is realized. 

ft A W-cycle with four levels of multigrid traverses the 
second-level grid exactly twice when communication is in- 
volved; four times in the third and fourth levels. The number 
of halo cells of a coarse-mesh is at least one-fourth that of 
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The above two improvements combine to reduce the 
total amount of data transferred per multigrid cy- 
cle. Relative to the baseline communication, this 
reduction in overhead is between 33.3% and 54.7%, 
depending on the granularity of the mesh involved. 

Communication Improvement Summary 

For the 72-block mesh in question, the relative im- 
provements in communication overhead with respect 
to the original scheme can be summarized as follows: 

• 20% reduction in overhead with one-pass 
(benchmark). 

• 50% reduction in overhead with delta form (in 
general). 

• 50% reduction in overhead with new load bal- 
ancer (benchmark, ethernet). 

• 33%-55% reduction in overhead with single-halo 
transfers (in general). 

• Communication reduced by a minimum of 75% 
when combined. 

RESULTS 

Design Results 

The design test cases to be presented here will focus 
on the wing redesign of a typical transonic business 
jet. The designs will be carried out independently 
using the Euler and Navier-Stokes equations. The 
discussion will conclude with comparisons between 
the final Euler and Navier-Stokes designs. For the 
Euler design case, reference [50] gives a treatment of 
the reliability of the flow solver as well as the abil- 
ity of the adjoint method to provide accurate gradi- 
ents very efficiently. With regard to the validity of 
the Navier-Stokes case, a comparison will be made 
for the initial configuration using both the inviscid 
and the viscous equations. The adjoint gradients 
for the Navier-Stokes test case will not be compared 
with finite difference calculations for two reasons. 
First, since the adjoint used to obtain the gradients 
is not of the viscous type, it is understood and ac- 
cepted that it will not produce gradients that are 
consistent with the finite difference approach. Sec- 
ondly, the computational cost of obtaining finite dif- 
ference gradients for the Navier-Stokes design on a 
large three-dimensional test case is prohibitive. In 
order to obtain accurate finite difference gradients, 

the next finer mesh. Hence, the baseline communication in 
the coarse grids is at least 81% as intense as it is in the finest 
mesh. Further study of a grid with only one interior cell in the 
fourth-level mesh shows that the baseline communication in 
the coarser grids can approach 183% that of the finest mesh. 


the flow solution must be converged at least two or 
three orders more than is necessary for adjoint gra- 
dients [52, 39]. Navier-Stokes solvers with their no- 
toriously slow convergence would take an unaccept- 
able number of iterations to achieve such a level of 
convergence. 


Flow Solver Comparison 

In the design demonstration of the multiblock op- 
timization algorithm to follow, a typical transonic 
business jet configuration is considered. The same 
geometry was also studied in [50, 15, 49]. Here 
the complete configuration including wing, body, na- 
celle, pylon, vertical tail, and horizontal tail will be 
used. Prior to the start of the designs, flow analyses 
were completed using the Euler and Navier-Stokes 
equations. 

Two alternative meshes were constructed that 
shared the same block topologies and differed only 
in the normal wall spacings and cell counts. The 
meshes were both created with a general C-0 topol- 
ogy and flow-through nacelles. Both meshes fea- 
tured 240 blocks, with the Euler mesh having 4.1 
million computational cells and the Navier-Stokes 
mesh having 5.8 million computational cells. The 
relative ratio between the two is smaller than ex- 
pected since the Euler mesh was constructed from 
the Navier-Stokes mesh simply by coarsening in 
the viscous direction. Furthermore, while complete 
configurations are being modeled, only the wing is 
treated as a viscous solid surface. The other compo- 
nents are handled with inviscid boundary conditions. 
A single flow calculation for the Euler solution using 
200 multigrid cycles converges 5 orders of magni- 
tude in 0.8 hours of wall clock time on 32 processors 
of an IBM SP2 machine. By comparison, a Navier- 
Stokes analysis uses 300 multigrid cycles to converge 
4.7 orders of magnitude and consumes 2.0 hours of 
wall clock time on 32 processors of an IBM SP2 ma- 
chine. An Euler and a Navier-Stokes solution are 
compared for the baseline configuration at the same 
over-design flight conditions and compared in Figure 
(1). The C p distributions depicted in the figure show 
the usual trend of having the shock strength and 
location being moved forward for the viscous anal- 
ysis when compared to the inviscid analysis. The 
Navier-Stokes calculation was carried out using an 
all-turbulent boundary layer with a Baldwin-Lomax 
turbulence model. The wall normal spacing of the 
first cell was such that at the cruise condition a 
y + — 1 would be attained at the half span trail- 
ing edge assuming a flat plate turbulent boundary 
layer. At the cruise condition (M = 0.80 and an 
altitude of 40,000 ft) the Reynolds number is 1.45 
million/ft. 
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The wing sweep for the design is a low 20 degrees. 
Thus, with the thick airfoil sections featured in the 
configuration, it represents a challenge to contain 
wave drag at the moderate Mach numbers of its de- 
sign point ( M = 0.75 - 0.82). Although they are 
not presented here, correlations of the wing pressure 
distributions have been obtained with experimental 
data. The comparisons with tunnel data are excel- 
lent except for a 5% difference in the location of the 
upper surface shock for the inviscid analysis. The 
Navier-Stokes solutions virtually overlay the wind 
tunnel data. 


Inviscid Transonic Flow Constrained Aircraft Design 

The Euler mesh described above was used during 
an inviscid redesign of the wing in the presence of 
the complete configuration. The baseline configu- 
ration was designed for flight at M = 0.80 with a 
Cl = 0.30. In this inviscid design case, a single point 
constrained design is attempted in which the Mach 
number and Cl are pushed to 0.82 and 0.35 respec- 
tively. The objective is to minimize configuration 
pressure drag at a fixed lift coefficient by modifying 
the wing shape. Eighteen Hicks-Henne design vari- 
ables were chosen for six wing defining sections for 
a total of 108 design variables. Spar thickness con- 
straints were also enforced at each defining station 
at x/c = 0.2 and x/c — 0.8. Maximum thickness 
was forced to be preserved at x/c = 0.4 for all six 
defining sections. Each section was also constrained 
to have the thickness preserved at x/c = 0.95 to en- 
sure an adequate included angle at the trailing edge. 

A total of 55 linear geometric constraints were im- 
posed on the configuration. Figure (2) shows over- 
lays of the C p distributions for the initial and final 
design after 6 NPSOL design iterations at four sta- 
tions along the wing. It is seen that the final result 
has reached a near-shock-free condition over much 
of the outboard wing panel. The drop in complete 
configuration pressure drag for this case was 24.6%. 
Noting that most of this drag reduction came from a 
decrease in wing wave drag implies that further im- 
provements may be possible through the reshaping 
of other components. The program took 13.5 hours 
to complete 6 design cycles on 32 processors of an 
IBM SP2. 


Viscous Transonic Flow Constrained Aircraft Design 

In a second design example for this complete busi- 
ness jet configuration, a viscous redesign of the wing 
is attempted. The design point is again chosen to 
be Mach = 0.82 with a Cl =0.35. Again the design 
algorithm, this time with a no-slip boundary condi- 
tion on the wing and the viscous terms turned on, 


is run in drag minimization mode. Figure (3) shows 
an iso-Cp colored representation of the initial design 
and the final design after 5 NPSOL design iterations. 
It is clearly seen that the rather low C p region ter- 
minated by a strong shock spanning the entire wing 
upper surface has been largely eliminated in the fi- 
nal design. Figure (4) shows the initial and final 
C p distributions achieved using the same 108 design 
variables and 55 geometric constraints employed for 
the inviscid test case. Note that the strong shocks 
present on both the upper and lower surfaces in the 
initial configuration have been eliminated. Further- 
more, it is apparent that the character of the changes 
to the pressure distribution follow those that oc- 
curred for the Euler based design to some extent. 
The main difference is that the Navier-Stokes de- 
sign tends to have a more benign behavior in the 
pressure distributions. The overall pressure drag for 
the complete configuration was reduced by 21.5%. 
Before proceeding to the next section, it should be 
noted that these business jet design examples are 
only representative of the potential for automated 
design, and are not intended to provide designs for 
actual construction. First, in each case only 5 or 6 
NPSOL steps were taken where considerably more 
could have improved the designs slightly. More im- 
portantly, since these are only single point designs, 
either may suffer unacceptable off-design behavior. 
In our most recent previous paper [49] we treated 
the case of inviscid design at multiple design points 
while here we address the case of viscous design at a 
single design point. Eventually, both the multipoint 
and viscous design capabilities must be treated con- 
currently. The calculation took 28 hours to complete 
5 design cycles on 32 processors of a IBM SP2. 


Crosscheck 

To see the difference between the two design cases 
explored here the Euler based design was reanalyzed 
using the Navier-Stokes equations. The mesh for 
this cross check was created by entering the design 
variable coefficients produced from the Euler design 
process into a set of inputs for a Navier-Stokes anal- 
ysis. The result of this reanalysis is shown in Figure 
(5). It is seen that the wing designed using the new 
Navier-Stokes approach employed here has slightly 
weaker shocks than the one designed via the use of 
the inviscid approach. This conclusion is also sup- 
ported by the fact that the drag improvement for 
the Euler design analyzed with the Navier-Stokes 
equations has a 20.5% improvement as contrasted 
with the 21.5% for the Navier-Stokes based design. 
However, it is seen from these two designs that the 
Euler strategy did not perform at all poorly and in- 
deed was able to achieve much of the improvement 
that was possible from a Navier-Stokes based design 
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method. This conclusion must be taken with caution 
since the sensitivity of the pressure distributions to 
the presence of the boundary layer can vary widely 
depending on the configuration. Furthermore, had 
we used a true viscous adjoint it may have been pos- 
sible to lower the pressure drag for the configuration 
even further. Clearly, much further testing of the 
design approaches developed here is needed. These 
calculations must be taken as the preliminary steps 
towards Navier-Stokes based aircraft design. 


Parallel Performance of the Method 

The following section presents a series of parallel 
scalability curves for the baseline code with the var- 
ious added improvements to reduce the communica- 
tion overhead. The scalability study was conducted 
for two different meshes whose ratio of computation 
vs. communication was chosen to be at both ends of 
the spectrum. The first mesh is the benchmark mesh 
referred to above. This mesh consists of 72 blocks 
of varying sizes and has a total of 734, 976 cells of 
which over 300, 000 reside in the halos. As one can 
see, the results on this mesh will provide an extreme 
test of the scalability of the method since a large 
part of the time will be spent in the communication 
process. The second mesh is referred to as the fine 
mesh and consists of 48 blocks of different sizes with 
a total of 2.5 million cells, from which about 600, 000 
reside in the halos. This mesh has a much higher 
ratio of computation to communication, and there- 
fore, the parallel scalability of the code is expected to 
improve when compared with the benchmark mesh. 
Nevertheless, the contrast provided between the re- 
sults in the two meshes is intended to be illustrative 
of the range of performance that can be expected for 
meshes of varying sizes. 

Scalability studies were conducted on a range of 
platforms whose performance characteristics are pre- 
sented in Table 4. These platforms include the 
very high performance communication network of 
the IBM SP2 (in User Space mode) and networks of 
workstations connected via standard lOBaseT eth- 
ernet as well as a higher performance (although still 
low cost) switched-lOOBaseT fast ethernet network. 
For the benchmark mesh, the range 1 — 16 was se- 
lected for the number of processors, whereas for the 
fine mesh, 4 — 32 was selected instead. This latter 
choice results from the inability to fit this large size 
calculation in a number of processors smaller than 4. 
In order to present speedup data for this mesh, the 
best achievable timing for a one processor calcula- 
tion was derived from the measurements of compu- 
tational time (not communication) observed for the 
4 processor case. It will be seen from the data that 
for this mesh the program must scale at worst lin- 
early between 1 and 4 processors and therefore our 


timing assumption is conservative. 

Figures 6-8 present the results obtained for the 
benchmark mesh using the IBM SP2 in User Space 
and Internet Protocol modes and the cluster of work- 
stations using switched-lOOBaseT for both the base- 
line and the improved load balancing algorithms. 
Since the SP2 in User Space Mode has a very high 
performance network, the impact of the use of dif- 
ferent load balancing techniques is small. For ex- 
ample, the parallel speedup for the 1-pass, single 
precision scheme using 16 processors remained vir- 
tually unchanged from 11.88 to 11.89. When used 
in IP mode (a lower performance network that very 
well simulates the results on the network of work- 
stations linked via switched-lOOBaseT) the results 
of improved load balancing schemes start to show 
up. For the same communication algorithm and the 
same mesh, the parallel speedup improved from 8.46 
to 8.70 using 16 processors. Using 8 processors on 
the switched-lOOBaseT network for the same algo- 
rithm, the parallel speedup improves from 3.212 to 
3.645. These results are even more dramatic for the 
case of the lowest performance network: unswitched 
lOBaseT (ethernet). In this case, the use of the im- 
proved load balancing scheme decreased the time to 
complete one multigrid iteration from 108.38 seconds 
to 57.46 seconds using 4 processors in the network. 

A more interesting point addressed by the results in 
these scalability plots is the increase in performance 
derived from successive improvements to the com- 
munication algorithm. Figures 6a, 7a, and 8a show 
the progression from the 3-pass and double precision 
scheme to the 1-pass and double precision, 1-pass 
and single precision, and 1-pass and single precision 
with single level halo schemes using the IBM SP2 in 
both US and IP modes and the network of switched- 
lOOBaseT workstations. These results use the orig- 
inal load balancing technique. Similar conclusions 
can be obtained from Figures 6b, 7b, and 8b which 
use the improved load balancing algorithm. For the 
US results, the parallel speedup using 16 processors 
improved from 10.41 to 13.00. For the IP results, the 
improvement is larger (as expected from the use of 
a lower performance communication network): par- 
allel scalability for the same number of processors 
improved from 7.11 to 10.69. It must be noted that 
in both cases, there is an upper limit to the parallel 
scalability achievable by the scheme derived from the 
impossibility of evenly load balancing a calculation 
in which the sizes of the different blocks in the mesh 
vary. For this case, this upper limit on parallel scala- 
bility using 16 processors is found to be 14.9, which is 
very close to the actual achieved value using the IBM 
SP2 in US mode, considering the highly communica- 
tive nature of the benchmark mesh. At the same 
time, the results for the switched-lOOBaseT network 
of workstations using 8 processors show an improve- 
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ment in parallel scalability from 2.98 to 5.41, point- 
ing out how much more important these successive 
communication improvements are for networks with 
lower performance. From these data, it is clearly 
seen that a high performance message passing im- 
plementation is essential to allow the effective use of 
networks of workstations. 

A more clear representation of the improvement 
in communication performance can be seen in Fig- 
ure 9a where the parallel scalability results using 
the IBM SP2 US and IP modes for the benchmark 
mesh have been overlaid. The lower curve shows the 
speedup obtained using the original scheme (3-pass, 
double precision) with the SP2 in US mode, whereas 
the upper curve shows the speedup of the improved 
scheme (1-pass, single precision, single level halo) 
using the SP2 in the lower performance IP commu- 
nication mode. Since the IP mode is representative 
of a switched-fast ethernet network of workstations, 
one can deduce from this graph that the improve- 
ments in communication performance introduced in 
this work make the use of networks of workstations 
for the design process entirely feasible. 

All these scalability studies were repeated for the 
fine mesh with 48 blocks and 2.5 million cells. The 
difficulties with parallel scalability exhibited by the 
previous figures are substantially ameliorated. In 
the interest of space, only a summary is presented in 
Figure 9b where the results using up to 32 processors 
are presented for the IBM SP2 in US mode and the 
different variations in communication algorithms. It 
can be seen that the communication improvements 
shift the scalability curve slightly upwards. This 
shift is small since for numbers of processors up to 
16 the ratio of communication time to processing 
time is very small. Note that some of the calcula- 
tions exhibit superlinear speedup for some numbers 
of processors; this effect has been observed previ- 
ously [12] and is attributed to a better utilization of 
the processor cache resulting from a decrease in size 
of the datasets that the processor operates on. The 
decrease in performance for the 32 processor results 
is a consequence of the poor load balancing that can 
be obtained with a mesh that only has 48 blocks. 

Although not reported in any of the figures, 
scalability studies were performed for the base- 
line unswitched ethernet network of workstations 
(lOBaseT). This communication fabric has the lim- 
itation that the total bandwidth of the system re- 
mains constant, independent of the number of pro- 
cessors in the calculation. This lack of bandwidth 
scalability results from the very nature of ethernet 
communication; the network can be used by only 
one processor at a time. Table 7 shows the re- 
sults for the parallel speedups obtained using the 
baseline algorithm (original load balancing, 3-pass, 


No. Processors 

Baseline 

Optimized 

1 

1.000 

1.000 

2 

0.768 

1.552 

4 

0.688 

2.301 

8 


1.493 


Table 7: Parallel Speedups for the Baseline and Op- 
timized Communication Schemes Using unswitched 
Ethernet (lOBaseT) 

double precision) and the improved communication 
(new load balancing, 1-pass, single precision, sin- 
gle level halo). The disappointing performance of 
this network clearly points out that typical ethernet 
networks are not suitable for parallel computations 
such as these ones which place a high demand on 
the communication layer. Despite this negative re- 
sult, it is apparent that the new communication al- 
gorithms dramatically improved the performance of 
the method. 

CONCLUSIONS 

A general aerodynamic shape optimization method 
has been developed and demonstrated for the case 
of Euler and Navier-Stokes automatic redesign of 
a complete aircraft configurations in transonic flow. 
The design method is implemented on distributed 
memory systems (including parallel computers and 
distributed networks of workstations) using the MPI 
Standard and it achieves excellent scalability in all 
platforms except for the simple unswitched- Ethernet 
case (lOBaseT). For a full configuration viscous 
mesh containing 5.8 million cells and 240 blocks, a 
complete design including a total of 5 design itera- 
tions can be completed in 28 hours using 32 proces- 
sors of an IBM SP2 system. The present method 
uses the Navier-Stokes equations for the solution of 
the flow and an inviscid adjoint solver for the calcu- 
lation of the gradient information. Further research 
and development work is required to assess the ap- 
plicability and usefulness of a viscous adjoint tech- 
nique. 

With the present method, the automatic aerody- 
namic design of complex aircraft configurations us- 
ing a high fidelity viscous flow model based on the 
RANS equations becomes feasible on the current 
generation of parallel computers. Moreover, the ap- 
plicability of the method is demonstrated for use in 
networks of workstations with a moderate invest- 
ment in networking resources (switched-lOOBaseT). 
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Eukr solution 


Euler solution 



I 1 1 1 1 1 1 1 1 1 \ 

0.00 0.10 0.20 0J0 0.40 0J0 0-60 0.70 0-t0 0.90 1-00 

x/c 


la: span station z = 0.283 



0.00 0.10 0.20 0.30 0.40 OJO 0.60 0.70 O.SO 0.90 1.00 

x/c 

lb: span station z = 0.472 



lc: span station 2 = 0.660 Id: span station z — 0.849 


Figure 1: Business Jet Configuration Configuration. Comparison of Baseline Solutions. 
M = 0.82, C L = 0.35 

, Euler Solution Pressure Distribution. 

— , Navier-Stokes Solution Pressure Distribution. 






[— — 1 I 1 1 1 1 1 1 1 — 1 

0.00 0.10 0.20 0.30 0.40 0.50 0-60 a 70 0.80 0.90 1.00 

me 


2c: span station z = 0.660 



i r i 1 1 1 1 1 1 1 1 

0.00 0.10 0.20 0 J 0 0.40 OjO 0.60 0.70 0.80 0.90 1.00 

me 

2d: span station 2 — 0.849 


Figure 2: Business Jet Configuration. Euler Based Drag Minimization at Fixed Lift. 
M = 0.82, C L = 0.35 

108 Hicks-Henne variables. Spar Constraints Active. 

, Initial Pressures 

— , Pressures After 6 Design Cycles. 
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Figure 3t Transonic Business Jet Configuration 
Iso-Cp Contours, Navier-Stokes. 
Baseline and Optimized Designs. 

M = 0.82, C L = 0.35 
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_ . NS solution 


— NS solution 



4a: span station z - 0.283 4b: span station z - 0.472 



I I 1 \ 1 1 1 1 1 \ 1 

0.00 0.10 0.20 0J0 0.40 0.50 0.60 0.70 0.S0 0.90 IDO 

x/c 

4c: span station z — 0.660 



Figure 4: Business Jet Configuration. Navier-Stokes Based Drag Minimization at Fixed Lift. 
M = 0.82, C L = 0.35 

108 Hicks-Henne variables. Spar Constraints Active. 

, Initial Pressures 

— , Pressures After 5 Design Cycles. 






I 1 J \ [ 1 1 1 [ 1 1 

0.00 0.10 0.20 OJO 0.40 OJO 0.60 0.70 0.10 0.90 1.00 

x/c 

5a: span station z = 0.283 




f 1 1 1 J 1 1 1 1 1 1 

0.00 0.10 0.20 0.30 0.40 0.S0 0.60 0.70 O.SO 0.90 1.00 

x/c 


5b: span station z = 0.472 



5c: span station z = 0.660 5d: span station z = 0.849 


Figure 5: Business Jet Configuration. Comparison of Euler and Navier-Stokes Designs. 
M = 0.82, C L = 0.35 

108 Hicks-Henne variables. Spar Constraints Active. 

, Navier-Stokes Pressures for Euler Based Design. 

— , Navier-Stokes Pressures for Navier-Stokes Baaed Design. 
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Parallel Speedup Parallel Speedup 


Parallel Speedup - FLO107-MB - 72 Block Mesh - IBM SP2 User Space 



Number of Processors 


Parallel Speedup - FLO107-MB - 72 Block Mesh - IBM SP2 User Space 



Number of Processors 


6a: Original Load Balancing Scheme. 


6b: Improved Load Balancing Scheme. 


Figure 6: Parallel Speedup for Benchmark Mesh Using the IBM SP2 in User Space Mode 


Parallel Speedup - FIO107-MB - 72 Block Mesh - IBM SP2 Inte met Protocol 



Number of Processors 


Parallel Speedup - FLO107-MB - 72 Block Mesh - IBM SP2 Inte met Protocol 



Number of Processors 


7a: Original Load Balancing Scheme. 


7b: Improved Load Balancing Scheme. 


Figure 7: Parallel Speedup for Benchmark Mesh Using the IBM SP2 in Internet Protocol Mode 
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Parallel Speedup - FLO107-MB- 72 Block Mesh - HP Switched- 100BaseT Network Parallel Speedup - H.O107-MB- 72 Block Mesh - HP Switched- 100BaseT Network 



Figure 8: Parallel Speedup for Benchmark Mesh Using the HP Cluster with Switched-lOOBaseT Fast 
Ethernet 


Parallel Speedup - FLO107-MB - 72 Block Mesh Parallel Speedup - FLO107-MB - 40 Block Mesh - IBM SP2 User Space 



9a: Comparison Between IBM SP2 US and IP Modes 
for the Benchmark Mesh. 


9b: Parallel Speedups for Fine Mesh. 


Figure 9: Parallel Speedup Comparison Results for Benchmark and Fine Meshes 
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